{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Iris flower dataset\n",
    "============\n",
    "\n",
    "In this tutorial we will see how to pre-process and prepare the Iris dataset. The Iris flower dataset or Fisher's Iris dataset is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in 1936. The data set contains three classes (Setosa, Versicolour, Virginica) of 50 instances each, where each class refers to a type of iris plant. One class (Setosa) is linearly separable from the other two (Versicolour and Virginica), whereas the latter are not linearly separable from each other.\n",
    "\n",
    "<p align=\"center\">\n",
    "<img src=\"../etc/img/iris_comparison.png\" width=\"750\">\n",
    "</p>\n",
    "\n",
    "There is no need to download the dataset. The CSV file called `iris_dataset.csv` included in this repository contains all the 150 features for the three classes. Each line of the CSV file contains the following information: sepal length, sepal width, petal length, petal width, class (0=Setosa, 1=Versicolor, 2=Virginica). The legnth and width is reported in centimeters. The difference between petal and sepal is showed in the following image:\n",
    "\n",
    "<p align=\"center\">\n",
    "<img src=\"../etc/img/iris_petal_sepal.png\" width=\"200\">\n",
    "</p>\n",
    "\n",
    "In the next sessions I am going to show you how to pre-process the dataset using numpy and tensorflow, and how to store the dataset in a TFRecord file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pre-processing in Numpy\n",
    "---------------------------\n",
    "\n",
    "First of all I will show you how to load the dataset and convert it to a Numpy array. That is easy thanks to the built-in `numpy.loadtxt()` method, that directly open a text file and convert the characters to numerical values. Here I shuffle the dataset and split it in two portions: test (33%) and training (66%)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "np_dataset = np.loadtxt(open(\"./iris_dataset.csv\", \"rb\"), delimiter=\",\", skiprows=1)\n",
    "np.random.shuffle(np_dataset) #shuffle the dataset rows\n",
    "train_features_array = np_dataset[0:100, 0:4]\n",
    "train_labels_array = np_dataset[0:100, 4:]\n",
    "test_features_array = np_dataset[100:150, 0:4]\n",
    "test_labels_array = np_dataset[100:150, 4:]\n",
    "\n",
    "print(train_features_array.shape)\n",
    "print(train_labels_array.shape)\n",
    "print(test_features_array.shape)\n",
    "print(test_labels_array.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is interesting to see if the dataset is **linearly separable**. We can use [Principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) to project the dataset in a two dimensional plane and plot it. I will use different colours for the various classes (red=setosa, green=versicolour, blue=virginica). If we can use a line to clearly separate some of the classes it means that they are linearly separable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from matplotlib import pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "np_dataset = np.loadtxt(open(\"./iris_dataset.csv\", \"rb\"), delimiter=\",\", skiprows=1)\n",
    "np_dataset = np_dataset[:, 0:4] #I do not care about the labels\n",
    "\n",
    "covariance_matrix = np.cov(np_dataset, rowvar=False)\n",
    "eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix) #compute eigenvalues and right eigenvectors \n",
    "eigen_vectors = eigen_vectors[:, 0:2] #taking only two dimensions\n",
    "data = np.dot(np_dataset, eigen_vectors) #apply the transformation and get the scaled data\n",
    "\n",
    "fig = plt.figure()\n",
    "ax1 = fig.add_subplot(111)\n",
    "ax1.plot(data[0:50, 0], data[0:50, 1], '.', color='red') #setosa\n",
    "ax1.plot(data[50:100, 0], data[50:100, 1], '.', color='green') #versicolour\n",
    "ax1.plot(data[100:150, 0], data[100:150, 1], '.', color='blue') #virginica\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the setosa class is linearly separable from the other two classes. In this case we could collaps the second and third class in a single one and use the dataset as benchmark for linear classifiers (e.g. Perceptron)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pre-processing in Tensorflow\n",
    "---------------------------------\n",
    "\n",
    "We can easily use Tensorflow to load the dataset and then save it as a **TFRecord**. We need a function to convert the numpy arrays to the TFRecord format. This can easilly be done. Then the numpy dataset is loaded and using the function is converted to TFRecords and stored on disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def numpy_to_tfrecord(features_array, labels_array, output_file):\n",
    "    with tf.python_io.TFRecordWriter(output_file) as record_writer:\n",
    "        for i in range(labels_array.shape[0]):\n",
    "            #Getting the data as train feature \n",
    "            float_feature = tf.train.Feature(float_list=tf.train.FloatList(value=features_array[i].tolist()))\n",
    "            int64_feature = tf.train.Feature(int64_list=tf.train.Int64List(value=[labels_array[i]]))\n",
    "            #Stuff the data in an Example buffer\n",
    "            example = tf.train.Example(features=tf.train.Features(feature={'feature': float_feature,\n",
    "                                                                           'label': int64_feature}))\n",
    "            #Serialize example to string and write in tfrecords\n",
    "            record_writer.write(example.SerializeToString())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "np_dataset = np.loadtxt(open(\"./iris_dataset.csv\", \"rb\"), delimiter=\",\", skiprows=1)\n",
    "np.random.shuffle(np_dataset) #shuffle the dataset rows\n",
    "train_features_array = np_dataset[0:100, 0:4]\n",
    "train_labels_array = np_dataset[0:100, 4:]\n",
    "test_features_array = np_dataset[100:150, 0:4]\n",
    "test_labels_array = np_dataset[100:150, 4:]\n",
    "\n",
    "print(train_features_array.shape)\n",
    "print(train_labels_array.shape)\n",
    "print(test_features_array.shape)\n",
    "print(test_labels_array.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "numpy_to_tfrecord(train_features_array, train_labels_array, \"./iris_train.tfrecord\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "numpy_to_tfrecord(test_features_array, test_labels_array, \"./iris_test.tfrecord\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the TFRecord files in a session\n",
    "------------------------------------------\n",
    "\n",
    "The TFRecords file have been stored on disk and can be used to train a model in a session. The best way to consume TFRecords is through the Dataset class in Tensorflow. We can load the TFRecords, parse them, and then declare a Dataset object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def _parse_function(example_proto):\n",
    "    features = {\"feature\": tf.VarLenFeature(tf.float32),\n",
    "                \"label\": tf.FixedLenFeature((), tf.int64, default_value=0)}\n",
    "    parsed_features = tf.parse_single_example(example_proto, features)\n",
    "    feature = tf.cast(parsed_features[\"feature\"], tf.float32)\n",
    "    feature = tf.sparse_tensor_to_dense(feature, default_value=0)\n",
    "    label_one_hot = tf.one_hot(parsed_features[\"label\"], depth=3)\n",
    "    return feature, label_one_hot\n",
    "\n",
    "print \"Loading the datasets...\"\n",
    "tf_train_dataset = tf.data.TFRecordDataset(\"./iris_train.tfrecord\")\n",
    "print \"Parsing the datasets...\"\n",
    "tf_train_dataset = tf_train_dataset.map(_parse_function)\n",
    "print \"Verifying types and shapes...\"\n",
    "print(tf_train_dataset.output_types)\n",
    "print(tf_train_dataset.output_shapes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with tf.name_scope('dataset'):\n",
    "    batch_size = 1\n",
    "    num_epochs = 5\n",
    "    tf_train_dataset = tf_train_dataset.batch(batch_size)\n",
    "    tf_train_dataset = tf_train_dataset.repeat(num_epochs)\n",
    "    iterator = tf_train_dataset.make_one_shot_iterator()\n",
    "    next_batch_features, next_batch_labels = iterator.get_next()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset can be used in a session asking for the features and labels we are interested in. Here I will ask for a new pair of features-labels and I will print the content. You can check in the CSV file and verify that everything is correct."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sess = tf.Session()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "features, labels = sess.run([next_batch_features, next_batch_labels])\n",
    "print features\n",
    "print labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Conclusions\n",
    "--------------\n",
    "\n",
    "In this tutorial we saw how to load load and use the iris dataset and how to store it in a TFRecord file. The dataset is lightweight and can be used for basic classifiers.\n",
    "\n",
    "**Copyright (c)** 2018 Massimiliano Patacchiola, MIT License"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
